Taxonomic Assignment
What is a taxonomic assignment?
A taxonomic assignment is a process of assigning an Operational Taxonomic Unit (OTU, that is, groups of related individuals) to sequences that can be reads or contigs. Sequences are compared against a database constructed using complete genomes. When a sequence finds a good enough match in the database, it is assigned to the corresponding OTU. The comparison can be made in different ways.
Strategies for taxonomic assignment
There are many programs for doing taxonomic mapping, and almost all of them follow one of the following strategies:
BLAST: Using BLAST or DIAMOND, these mappers search for the most likely hit for each sequence within a database of genomes (i.e. mapping). This strategy is slow.
Markers: They look for markers within the sequences, that correspond to those markers within a database that are associated with particular taxonomies. This allows the sequences to be classified and assigned a particular taxonomy depending on the hits obtained.
K-mers: A genome database is broken into pieces of length k to be able to search for unique pieces by taxonomic group, from a lowest common ancestor (LCA), passing through phylum to species. Then, the algorithm breaks the query sequence (reads/contigs) into pieces of length k, looks for where these are placed within the tree and make the classification with the most probable position.
Abundance bias
When you do the taxonomic assignment of metagenomes, a key result is the abundance of each taxon or OTU in your sample. The absolute abundance of a taxon is the number of sequences (reads or contigs, depending on what you did) assigned to it. Moreover, its relative abundance is the proportion of sequences assigned to it. It is essential to be aware of the many biases that can skew the abundances along the metagenomics workflow (see figure below) and that because of them, we may not be obtaining the actual abundance of the organisms in the sample.
Using Kraken 2
Kraken 2 is the newest version of Kraken, a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.
When looking at the settings for running Kraken2 we can see that in addition to our input files, we also need to select a database to compare them. The database you use will determine the result you get for your data. Imagine you are searching for a recently discovered lineage that is not part of the available databases. Would you find it?
There are several databases compatible to be used with Kraken2 in the taxonomical assignment process.
Therefore, when running Kraken2 we need to give the database we use serious consideration.
To use Kraken2 our input is the Unique.seqs fasta output, we can use default parameters but under the Create Report tab, you want to select “Yes” for the --report option. For this step, try using two different databases you think would work with your sample.
Using B_Sample_98 as our example, we can examine the Kraken2 Classification table:
| Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
|---|---|---|---|---|
| C | 1 | Microbacterium (taxid 43534) | 336 | 3:47 43309:1 3:5 43309:9 43534:9 43490:33 3:87 1:1 3:5 1:3 3:38 43490:1 3:5 43416:1 3:5 43416:8 43490:1 3:5 43309:2 43490:1 3:2 43416:5 43534:1 3:21 1:5 3:1 |
| C | 2 | Chitinophagaceae (taxid 44006) | 332 | 1:1 3:5 1:3 3:4 43868:5 44006:33 43869:14 3:24 1:23 3:74 46157:2 3:5 46157:43 3:62 |
| C | 3 | Bacteria (taxid 3) | 336 | 3:194 1:1 3:5 1:3 3:2 1:6 3:83 1:2 3:5 1:1 |
This information may need to be clarified. Let us take out our “cheatsheet” to understand some of its components:
| Column example | Description |
|---|---|
| C | Classified or unclassified |
| 1 | FASTA header of the sequence |
| Microbacterium (taxid 43534) | Tax ID |
| 336 | Read length |
| 3:47 43309:1 3:5 43309:9 43534:9 43490:33 3:87 1:1 3:5 1:3 3:38 43490:1 3:5 43416:1 3:5 43416:8 43490:1 3:5 43309:2 43490:1 3:2 43416:5 43534:1 3:21 1:5 3:1 | kmers hit to a taxonomic ID e.g. tax ID 43309 has nine hits, tax ID 43416 has eight hits, etc. |
The Kraken2 file could be more readable. So let us look at the Report file:
Below is an explanation regarding this information:
| Column example | Description |
|---|---|
| 78.13 | Percentage of reads covered by the clade rooted at this taxon |
| 587119 | Number of reads covered by the clade rooted at this taxon |
| 587119 | Number of reads assigned directly to this taxon |
| U | A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply ‘-’. |
| 0 | NCBI taxonomy ID |
| unclassified | Indented scientific name |
Visualization of taxonomic assignment results
After we have the taxonomy assignation, what follows is some visualization of our results. Krona is a hierarchical data visualization software. Krona allows data to be explored with zooming and multi-layered pie charts and supports several bioinformatics tools and raw data formats. To use Krona in our results, let us first go into our taxonomy directory, which contains the pre-calculated Kraken outputs.
Krona
With Krona, we will explore the taxonomy of our Kraken Reports. To do this, firs we need to convert our Kraken report to a format compatible with Krona on Galaxy. To do this, search for the tool Krakentools: Convert kraken report file, select Kraken2 on data X: Report and run with default settings.
Once this completes, search for Krona pie chart, change input from Taxonomy to Tabular and select Krakentools: Convert kraken report file on data X. Everything else can be left as default.
Once the job is complete, click the “eye icon” and view the Krona pie chart. You will see a page like this:
Pavian
Pavian is another visualization tool that allows comparison between multiple samples. Pavian should be locally installed and needs R and Shiny, but we can try the Pavian demo WebSite to visualize our results.
First, we need to download the files needed as inputs in Pavian; this time, we will visualize the assignment of the reads of both samples: Kraken2 on data X: Report.
These files correspond to our Kraken reports.
We go to the Pavian demo WebSite, click on Browse, and choose our reports. You need to select both reports at the same time.
We click on the Results Overview tab.
We click on the Sample tab.
We can look at a comparison of both our samples in the Comparison tab.